Coriolis: enabling metagenomic classification on lightweight mobile devices

Abstract Motivation The introduction of portable DNA sequencers such as the Oxford Nanopore Technologies MinION has enabled real-time and in the field DNA sequencing. However, in the field sequencing is actionable only when coupled with in the field DNA classification. This poses new challenges for metagenomic software since mobile deployments are typically in remote locations with limited network connectivity and without access to capable computing devices. Results We propose new strategies to enable in the field metagenomic classification on mobile devices. We first introduce a programming model for expressing metagenomic classifiers that decomposes the classification process into well-defined and manageable abstractions. The model simplifies resource management in mobile setups and enables rapid prototyping of classification algorithms. Next, we introduce the compact string B-tree, a practical data structure for indexing text in external storage, and we demonstrate its viability as a strategy to deploy massive DNA databases on memory-constrained devices. Finally, we combine both solutions into Coriolis, a metagenomic classifier designed specifically to operate on lightweight mobile devices. Through experiments with actual MinION metagenomic reads and a portable supercomputer-on-a-chip, we show that compared with the state-of-the-art solutions Coriolis offers higher throughput and lower resource consumption without sacrificing quality of classification. Availability and implementation Source code and test data are available from http://score-group.org/?id=smarten.


Proofs
This section contains the proofs of the theorems from Section 3. For clarity, we restate Algorithms 1 and 2. First we will prove a useful property of the lcp function in Lemma 1.
Proof. The reader can verify that LCP (i), P (ℓ + 1), and C R (i) on lines 4, 5, and 6 are always defined and that the algorithm always terminates. We will proceed with a proof by contradiction. That is, we will assume lcp(P, S r ) ̸ = max{lcp(P, S k ) | 1 ≤ k ≤ |S|} and derive false. Let m = max{lcp(P, .e] if and only if lcp(P, S k ) = m. By our assumption, r / ∈ [b..e], so we have the following two cases.
Case 1: r < b. Since 1 ≤ r < b, we have b ≥ 2, implying that there's some iteration of the algorithm in which i = b. Consider the beginning of this iteration. We will first show that the if statement on line 4 holds. This is clearly the case when b − 1 = j, so consider instead the case when b − 1 ̸ = j. We must have b > 2, since otherwise we'd have b = 2 and j = 1 contrary to our assumption. Because b > 2, there must be some most recent prior iteration where i = i ′ for some 2 ≤ i ′ < b in which ℓ was assigned its current value LCP (i ′ ) on line 5. Observe that we must have LCP (i ′ ) ≥ |P | or C R (i ′ ) ̸ = P (ℓ + 1). To demonstrate this, assume for the purpose of contradiction that LCP (i ′ ) < |P | and C R (i ′ ) = P (ℓ + 1). Then we'd assign i ′ to j on line 7. Since by the definition of i ′ line 7 isn't executed again until at least iteration i = b, we have j = i ′ which implies i ′ < b − 1 by our assumption that j < b − 1. This means there is some iteration after i ′ but before b. But then during the iteration after i ′ , the if condition on line 4 would hold and we'd assign LCP (i ′ + 1) to ℓ, contradicting the definition of i ′ . Thus, we have LCP (i ′ ) ≥ |P | or C R (i ′ ) ̸ = P (ℓ + 1). In the first case where LCP (i ′ ) ≥ |P |, we know from the definition of b that lcp(P, S b−1 ) < lcp(P, S b ). By Lemma 1 we have LCP (b) = lcp(P, S b−1 ). And since lcp(P, S b ) ≤ |P |, we can conclude that LCP (b) ≤ |P |, and thus LCP (b) ≤ ℓ. In the second case where C R (i ′ ) ̸ = P (ℓ+1), observe that by the definition of i ′ , we must have the if condition on line 4 be false for every iteration after i ′ and before b. Thus, for all This shows that the if statement on line 4 holds, so we execute lines 5 and 6. Now we have ℓ = LCP (i). Since lcp(P, S b−1 ) < lcp(P, S b ) ≤ |P |, we have ℓ = lcp(P, S b−1 ) by Lemma 1, so ℓ < |P |. Also, by ℓ < lcp(P, S b ), we have P (ℓ + 1) = S b (ℓ + 1) = C R (b). Thus, the if statement on line 6 holds and we assign b to j. But since j is monotonically increasing and r < b, this means we don't return r, contradicting our assumption that r is returned.
Case 2: e < r. Since S is lexicographically ordered, each of S e , S e+1 , . . . S r share the same prefix of length lcp(S e , S r ). By definition, we have lcp(P, r) < lcp(P, e), which by Lemma 1 implies that lcp(e, r) = lcp(P, r) < lcp(P, e). Thus we have S e (lcp(e, r) + 1) = P (lcp(e, r) + 1) ̸ = S r (lcp(e, r) + 1) since S is prefix free. Let k be the first index in [e + 1..r] such that S e (lcp(S e , S r ) + 1) ̸ = S k (lcp(S e , S r ) + 1). We will show inductively that at the end of every iteration of the for loop where i ∈ [k..r] we maintain that j < k and that ℓ = lcp(e, r). When i = k, we have j < k at the start of the iteration. By definition e < k ≤ r and thus we have LCP (k) = lcp(S e , S r ), meaning the if condition on line 4 is satisfied and we assign lcp(S e , S r ) to ℓ, maintaining the invariant on ℓ. By definition, we have S k (LCP (k)) ̸ = S e (LCP (k)) = P (LCP (k)), so C R (k) ̸ = P (LCP (k)) and we don't execute line 7, maintaining that j < k. Now assume the invariant held for i − 1 ∈ [k..r], and we are now at the beginning of the next iteration where i ∈ [k..r]. If LCP (i) > ℓ then we do nothing, and the invariant is trivially maintained. Since all suffixes in the range share a prefix of length ℓ, the only other possibility is LCP (i) = ℓ. Thus, assigning LCP (i) to ℓ still maintains the invariant. By lexicographic order, we must have S i (ℓ + 1) > S e (ℓ + 1), which implies S i (ℓ + 1) ̸ = P (ℓ + 1). As a result, we again do not execute line 7, and maintain that j < k. Now that we've shown the invariant is maintained, consider the iteration when i = r. Since by our invariant we have j < k at the end of the iteration, and since j is only ever assigned the current value of i, which is strictly increasing, this means that we don't return r, contradicting our assumption.
Since we arrive at a contradiction in either case, we must have lcp(P, S r ) = max{lcp(P, S k ) | 1 ≤ k ≤ |S|}.
Proof. By Theorem 1, we have lcp(P, S j ) = max{lcp(P, S k ) | 1 ≤ k ≤ |S|}. Observe that by our assumptions we have ℓ ′ = lcp(P, S j ) = max{lcp(P, S k ) | 1 ≤ k ≤ |S|} after assigning ℓ ′ on line 2. By the definition of lcp and since S j isn't a prefix of P , we have the following three cases.
Case 1: |P | = ℓ ′ . In this case we execute the while loop on line 4. Since the algorithm decrements j every iteration and terminates when j < 1, the loop always terminates. We will show that before every iteration of the loop we maintain that lcp(P, S j ) = ℓ ′ . By the definition of ℓ ′ this is clearly true before the first iteration. Now assume the invariant holds before the start of an iteration. If the loop doesn't terminate, then we have LCP (j) ≥ ℓ ′ , which implies that lcp(P, S j−1 ) ≥ ℓ ′ by our assumption. And by the definition of ℓ ′ , this means lcp(P, S j−1 ) = ℓ ′ . Since we then decrement j, we then have lcp(P, S j ) = ℓ ′ before the next iteration. Thus, when the loop terminates, we have lcp(P, S j ) = ℓ ′ . Since lcp(P, S j ) = ℓ ′ = |P |, we have that P is a prefix of S j , thus P ≤ S j . By the termination condition, we have either j = 1 or LCP (j) < ℓ ′ . If j = 1, since P ≤ S j , clearly we have j = min{1 ≤ k ≤ |S| + 1 | k > |S| ∨ P ≤ S k }. If LCP (j) < ℓ ′ , then by the fact that S is lexicographically ordered, we have that S j−1 < P , thus again j = min{1 ≤ k ≤ |S| + 1 | k > |S| ∨ P ≤ S k }.
Theorem 3. Assume that S, LCP , C L , and C R reside in internal memory and that each S i ∈ S is represented as a pointer to some string stored in external memory. Given P ∈ Σ * such that no S i ∈ S is a prefix of P and ℓ ≤ max{lcp(P, Proof. The only operation that performs any I/Os is computing ℓ ′ on line 2 by loading characters of S j ℓ + 1..|S j | and P ℓ + 1..|P | . This computation loads at most 2 · P ℓ + 1..|P | characters (for both P and S j ), and since P ℓ + 1.

Coriolis Runtime Characteristics
Here we report additional measurements taken while comparing the different classifiers. Table S1 compares each of Kraken2, Centrifuge, and Coriolis while varying the number of available cores. We note that Centrifuge was unable to run with more that two threads. Table S2 provides statistics reported by sar during the experiments from Section 5.3.
In Table S1, we observe that the speedup for Kraken2 and Centrifuge is negligible, while Coriolis achieves reasonable speedup for two cores, which begins to drop off once reaching four cores. Even with a single core, Coriolis is almost at the speed of Kraken2. Once using two cores, Coriolis is nearly twice as fast as Kraken2 and over four times faster than Centrifuge, and still achieves non-negligible speedup when increasing to four cores. It is worth reiterating that our implementation of Coriolis makes no effort to parallelize the classifier. Instead, simply by expressing Coriolis in our programming model, the SMARTEn runtime automatically parallelizes Coriolis' execution. With this in mind, the scaling of Coriolis is quite impressive, and demonstrates the power of the SMARTEn programming model.
In Table S2, we see the impact of I/O efficiency and CPU utilization on the speed of classification. Our first observation is that for each classifier, the user CPU time (that is, the total execution time spent utilizing CPU in the user-space) remains effectively constant across different DMRs. The small variations present (in the order of seconds) can be easily attributed to noise in the system once considering that the tools ran for multiple hours. The system CPU time (that is, the total execution time spent by the kernel using the CPU) similarly remains relatively fixed. We can see that the system CPU time dominates the user CPU time. This is because the I/O intensity is quite significant, with each classifier needing to load terabytes of data over the course of execution. Because each of these classifiers use memory mapping, the high I/O intensity puts pressure on the kernel to manage the memory-mapped pages. We see that the total amount of data read from disk by Centrifuge is significantly larger than Kraken2 and Coriolis, explaining why Centrifuge is substantially slower. We see that increasing the DMR has a far greater impact on I/O intensity for Kraken2 than it does for Coriolis, explaining why Coriolis is faster than Kraken2 for a high DMR.